Li S, Xiao T, Li H, et al. Person search with natural language description[C]//Proc. CVPR. 2017.
1. Overview
1.1. Motivation
- existing methods mainly focus on searching persons with image-based or attribute-based queries (limitations for a practical usage)
- no person dataset or benchmark with textual description available
In this paper, it studied person search with natural language description
- proposed Recurrent Neural Network with Gated Neural Attention mechanism (GNA-RNN)
- collect CUHK Person Description Dataset (CUHK-PEDES)
1.2. Image-Based Query
- person re-identification which requires at least one photo of the queried person being given.
1.3. Attribute-Based Query
- pre-defined semantic attributes which have limited capability of describing persons’s appearance. And labeling exhausted set of attributes is expensive.
1.4. Contribution
- person search with language is more practical for real-world
- investigate different solution. image caption, VQA, visual-semantic embedding
- proposed GNA-RNN
1.5. Related Work
1.5.1. Language Dataset for Vision
- Flickr8K, Flickr30K
- MS-COCO Caption
- Visual Genome
- Caltech-UCSD
- Oxford-102 flowers
1.5.2. Deep Language Models for Vision
- Image Caption. NeuralTalk
- VQA. Stacked Attention Network
- Visual-Semantic Embedding
1.6. CUHK-PEDES Dataset
- 40,206 images of 13,003 persons from five existing person re-identification datasets (CUHK03, Market-1501, SSM, VIPER, CUHK01)
- 80,412 sentences for 40,206 images (2 sentences/img). details about appearance, actions, poses and interaction with other objects
- high-frequency word
1.6.1. User Study
- Language vs Attribute.
language description are much precise and effective in describing persons than attributes. (top-1: 58.7% vs 33.3%; top-5: 92.0% vs 74.7%). - Sentence Number and Length
3 sentences achieve highest retrieval accuracy. the longer the sentences are, the easier for users to retrieve the correct images. - Word Types
nouns provide most information followed by the adjectives, while the verbs carry least information.
2. GNA-RNN
- key to build word-image relations. given each word, search related regions to determine whether the word with its context fit the image
- confidences of all relations should be weighted and then aggregate to generate the final sentence-image affinity
2.1. Visual Units
- Input. resize to 256x256
- Output. 512 visual units
- pre-trained on our dataset for person classification based on person IDs
- during jointlt training, only update cls-fc1 and cls-fc2
2.2. Attension over Visual Units
- word are encoded into K-length one-hot vectors. K is the vocabulary size
- embedded one-hot vector and concat with image features
through LSTM-FCs-Softmax, generate unit-level attention at each word
summation of weighted attention image features
summation of all T words
2.2.1. LSTM
- h. tanh
2.3. Word-Level Gates for Visual Units
- different words carry significantly different amount of information for obtaining language-image affinity. (“white” should be more important than the word “this”)
- unit-level attention can not reflect such difference
- learn word-level scalar gates at each word
2.4. Loss Function
2.5. Details
- SGD
- positive:negative=1:3
- batch size 128
- all FC are 512 units except gate-fc1
3. Experiments
3.1. Dataset
- training set. 11,003 persons; 34,054 images; 68,108 sentence descriptions
- testing set. 3,074 images of 1,000 persons
- validation set. 3,078 images of 1,000 persons
3.2. Comparison
- LSTM might have difficulty encoding complex sentences into a single feature vector
- word-by-word processing and comparison might be more suitable for the person search problems
- RNN is more suitable in processing natural language data
3.3. Ablation Study
- initial training affects the final performance a lot
3.4. The Number of Visual Unit
- more units might over-fit the dataset